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ABSTRACT 



Iterative information retrieval from a large database of 
textual or text-containing documents is facilitated by auto- 
matic construction of faceted representations. Facets are 
chosen heuristically based on lexical dispersion, a measure 
of the number of different words with which a particular 
search expression co-occurs within a given type of lexical 
construct (e.g., a noun phrase) appearing in the document 
set. Words having high dispersion rates represent "facets" 
that may be used to organize the documents conceptually in 
accordance with the search expression, effectively providing 
a concise, structured summary of the contents of a result set 
as well as presenting a set of candidate terms for query 
reformulation. 
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METHOD AND APPARATUS FOR Clustering can also be performed automatically. "Bottom- 

AUTOMATIC CONSTRUCTION OF up" and "top-down" clustering techniques utilize algorithms 

FACETED TERMINOLOGICAL FEEDBACK that generate a hierarchical category structure and assign 

FOR DOCUMENT RETRIEVAL each document to one or more categories. These techniques 

5 are computationally demanding, however, and do not nec- 
essarily generate document categories that ultimately prove 

FIELD OF THE INVENTION meaningful to users. 

The present invention relates to automated document Another approach to providing users interactive feedback 

searching, and in particular to the introduction of to assist searching is to display terminology "relevant" to the 

conceptual/terminological structure to a document set based 10 search. The difficulty here is two-fold, first determining 

on textual content. which of the thousands of potentially related terms are likely 

% to be most useful in this instance for query reformulation 

BACKGROUND OF THE INVENTION and, second, arranging those terms in some way that helps 

The exponential growth of the Internet has provided ie to elucidate the search space. A manually constructed the- 

consumers with the ability to access vast quantities of 15 saurus or a database of term-to-term correlations denved 

information— so much, in fact, that guiding consumers to from statistical corpus analysis can be used to identify terms 

the information they desire is now an industry. Commercial that are semantically or statistically related to terms in a 

"search engines" such as ALTAVISTA, accessible over the user ' s 1 uer y expression. Alternatively, a result list can be 

Internet, maintain massive databases of Internet- accessible „ n analyzed at run-time for frequently occurring terms or for 

documents and accept user queries to search these docu- phrases containmg query terms. In most cases, the terms are 

ments 4 simply presented as an unstructured list (perhaps ordered 

The search engine may maintain the documents in an alphabetically or by frequency), 

unstructured form, in which case the user searches by DESCRIPTION OF THE INVENTION 

"keyword." Essentially, the search engine accepts one or 25 ^ — 

more words that the user considers relevant to the topic of BRIEF SUMMARY OF THE INVENTION 
interest, and electronically identifies documents containing 

the entered words. Search sophistication can be increased by The present invention facilitates searching by extracting, 

means of Boolean capability, which allows the user to from a collection of documents within a corpus, terms 

concatenate search terms into strings in accordance with 30 representing key informational concepts (herein referred to 

operators such as AND and OR. In practice, it is found that as "facets" of the document collection). When the user 

simple keyword queries, while easily composed, tend to performs a keyword or other conventional search, the facets 

underspecify the set of desired documents (retrieving large pertaining to the documents retrieved by the search are 

numbers of irrelevant documents). Such problems arise from returned to the user along with the documents (which are 

the user's lack of knowledge of the subject matter giving rise 35 generally presented in summary form in a results list). The 

to the information need, unfamiliarity with the underlying facets may be used directly to refine the search, but also 

document collection and its content with respect to that serve to educate the user about the information content of the 

need, and the difficulty of translating even a well-defined document corpus and the result list as these relate to the 

need into an effective linguistic formulation. information need. 

Current search interfaces typically offer a query- 40 The invention constructs "faceted" representations of 

refinement loop that allows the user to enter the initial search documents by identifying a set of lexical dimensions that 

expression, evaluate the results returned, and then modify roughly characterize concepts likely to have informational 

the query by addition of keywords. Evaluating search results relevance. It is found that lexical items signifying key 

can be a time- and energy-consuming task, however. In concepts within a domain often tend to co-occur with other 

surveying a potentially long list of titles and document 45 useful concepts within certain syntactic contexts, such as 

summaries, the user must not only evaluate the likely noun phrases. Consequently, facets are chosen heuristically 

relevance of the retrieved documents, but also assess the based on "lexical dispersion," a measure of the number of 

likelihood that the database will eventually be able to satisfy different words with which a particular word co-occurs 

the information need (or part of it); assess the degree to within such syntactic contexts. The greater the level of 

which the current query formulation has expressed the need; 50 dispersion — i.e., the more different words with which the 

learn about the information space and the vocabulary used to given word appears in the documents within the allowed 
describe the domain within this particular database; and _ syntactic context — the greater is the likelihood that the given 

ultimately decide on an appropriate query reformulation word (along with the lexical constructs in which it occurs) 

strategy to the extent necessary. will represent a useful conceptual category relevant to the 

To help the user focus his or her search without this kind 55 query topic. The facets and their corresponding lexical 

of extensive analysis, the documents may be organized constructs effectively provide a concise, structured summary 

according to content, allowing the user to browse through a of the contents of a result set as well as a set of candidate 

category of documents or at least to confine a keyword terms for iterative query reformulation, 

search within such a category. "Clustering" techniques are Accordingly, in a first aspect, the invention comprises a 

frequently employed to categorize related documents within 60 method of selecting and organizing documents from a docu- 

a document corpus. But generating the categories and plac- ment corpus in response to a user-provided search expres- 

ing the documents within them is an arduous task. Clustering sion. Preferably, the document corpus is first analyzed to 

can, for example, be accomplished manually, with each identify potential facets; this is accomplished by searching 

document being individually examined by a clerk who the textual content of the documents for lexical constructs 

assigns it to the proper category. Naturally, this approach is 65 conforming to a selected syntactic pattern, such as a noun 

prohibitive for commercial Internet search engines that store phrase. The lexical constructs, in turn, are examined at query 

millions of documents. time to derive dispersion rates for words within the con- 
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structs. The dispersion rates are assumed to indicate the , 1. System Organization and Operation 

conceptual relevance of the words to which they relate, and With reference to FIG. 1, an exemplary environment in 

these words are ranked in accordance with their dispersion which the present invention is implemented comprises a 

rates. search engine 100, which is accessible to any of a series of 

The user's conventional search is processed in the usual 5 chent computers representatively indicated at 105,, 105 2 , 

fashion, returning to the user a list of documents conforming 105 3 communicating with search engine 100 over a com- 

to the search criteria. The user also receives a list of the P uter network 110. Search engme 100 comprises a server 

facets contained in the retrieved documents. The facets, and computer 130 coupled to a large storage device 130, which 

the lexical constructs within which they appear, may be used maintains the searchable document corpus or database. The 

for query reformulation in various ways. The user may, for 10 terms "server" and "host" are herein used interchangeably to 

example, recognize a particular construct as especially rel- denote a central facility consisting of a single computer or 

evant to the information need and choose to see a list of S rou P of computers that behave as a single unit with respect 

documents containing this lexical construct. Alternatively, lo the clients 105 - In order to ensure P ro P er routing of 

the user may choose to augment the original search expres- messages between the server 120 and the intended client, the 

sion with a selected word or lexical construct. 15 messages are first broken up into data packets, each of which 

receives a destination address according to a consistent 

BRIEF DESCRIPTION OF THE DRAWINGS protocol, and which are reassembled upon receipt by the 

target computer. A commonly accepted set of protocols for 

The foregoing discussion will be understood more readily this purpose are the Internet Protocol, or IP, which dictates 

from the following detailed description of the invention, 20 routing information; and the transmission control protocol, 

when taken in conjunction with the accompanying drawings, or TCP, according to which messages are actually broken up 

in which: into IP packets for transmission for subsequent collection 

FIG. 1 schematically illustrates a representative environ- and reassembly. TCP/IP connections are quite commonly 

ment for the present invention; employed to move data across the Internet. 

___ „ L . „ .„ 25 FIG. 2 depicts the internal organization of server 120 as 

FIG. 2 schematically illustrates a server configured for a ^ of bk)cks Qr modules ^ ^ lement the 

operation in accordance with the present invention; and performed by the A ^ interface module 

FIGS. 3A and 3B are screen displays illustrating the 205 permits the server's operator to interact with and 

operation of a preferred embodiment of the invention, program the server. A control block 210 contains computer- 

30 executable instructions for implementing the analytical 

DETAILED DESCRIPTION OF THE functions of the invention as described in greater detail 

PREFERRED EMBODIMENTS below. A conventional search module 212 performs keyword 

, . . . . . or other conventional searches on document database 130 in 

Hie present invention exploits the recognition that in response t0 user . provided queries< ^ ^Tva's operating 

English, as in other languages, new concepts are often 3S system 215 directs the execution of low-level, basic system 

expressed not as new single words, but as concatenations of fictions such as memory allocation, file management and 

existing nouns and adjectives. While this tendency is espe- mteraction with docume nt database 130 (FIG. 1). A network 

ciaUy noticeable in technical language, where long chains of communication block 220 provides programming to connect 

nouns ( central processing umt, byte code interpreter' ) wilh computer network 110 , wh i c h may be a local-area 

are not uncommon, compound terms permeate everyday 40 network ("LAN"), a wide-area network ("WAN"), or the 

language as well. Noun compounds are regularly used to Imernet A communication module 2 20 drives a network 

encode ontological relationships— "oak tree/ for example, interface 225, which contains data-transmission circuitry to 

specifies a type of tree-as well as other kinds of relation- tfansfer streams of digitally eaooded data ovef the commu . 

ships: in the term "tree rings, rings are a property of the nicatiofl Unes defining netWQrk m 

tree; in "tree roots," the roots are a part of the tree. One 45 [n the case of Imerne( conne ctions, data exchange with a 

would therefore expect documents dealing with trees to user (of simultane0 usly with multiple users) is typically 

contain many different phrases with the word "tree,' since effected over me web 5 means of web , n this ^ 

such compounds linguistically serve to identify subordinate S€pnr UQ C0Dtains a series of wcb temp iates, which 

categories, attributes, and other relationships within the are im pi em ented as formatting (mark-up) instructions and 

domain of trees. 50 associated data> and / or so -called "applet" instructions that 

The present invention exploits the observation that a cause a properly equipped remote computer to present a 

word's "lexical dispersion" — i.e., the number of different dynamic display. Management and transmission of a 

terms with which the word appears within certain syntactic selected web page is handled by a web server module 230, 

constructions — can be used to identify key concepts, or which is conventional in the art. 

"facets," of the document set. In a representative application, 55 During a search operation, document text and/or index 

the invention is implemented in an interface to a document data is rapidly cycled in and out of a memory partition or 

search engine facilitating document searching and browsing buffer 235 in accordance with the user's search query and 

on the World Wide Web (hereafter "web"). More generally, the operation of the invention. Typically, the user does not 

the invention is useful in a broad range of information- interact directly with server 120, but instead with an appli- 

processing tasks, including data mining, information 60 cation running on a client machine 105. In this sense, the 

filtering, and targeted document retrieval. As used herein, term "application" denotes a body of functionality for 

the term "document" includes virtually any digitally stored obtaining, processing and/or presenting data to a user. As 

item having textual content (e.g., items such as articles, noted above, the server may support web communications 

papers, statements, correspondence, etc., whose content is via the HyperText Transfer Protocol (http), formatting web 

exclusively verbal, or mixed- format items such as illustrated 65 pages that it serves over the Internet to clients 105 contacting 

books or labeled images that contain only some searchable the server by means of a web browser. Using conventional 

text). CGI scripting and image-map techniques, the web page 
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permits a remote user to make selections and submit queries from which the user can gain some insight into the subject 

based on graphical representations; the user selects elements matter of the result set without wading through the title list 

of the graphical web-page display using a position-sensing or examining sample articles. 

device (typically a mouse), and the web page communicates The user can respond to the list of facets and facet phrases 

the selection (actually, the two-dimensional coordinates of 5 in several ways. Preferably, the invention is configured to 

the selection) and text to the server 120. facilitate iterative search by combining a selected facet 

The manner in which server 120 processes a search query and /° r facet P hrase with the original search query. For 

is best understood with reference to FIGS. 3Aand 3B, which example, in a preferred approach, the invention allows the 

illustrate representative screen displays generated by the user t0 a * a <f P hras f froni a pull-down hst 325, and 

present invention and viewed by a remote user. In particular, 10 u P on t of the use / s s fi ectl0n > f ont t ro1 bl ?<* 

a web browser (such as the COMMUNICATOR broswer ^^!^ W n ^ 

.. , . VT \ ~ . . ~ . original search query, (2) the facet phrase (as a single 

nJlu T € ™ muQl ^ oa \ C0 ; P ' f 0 [ thC contiguous term for search purposes), and (3) the individual 

EXPLORER browser from Microsoft Corp.), indicated gen- compone nts of the facet phrase. This approach is especially 

erally at 300, runs as an active process on a client machine useful m conjunction with search engines whose ranking 

105 (FIG. 1). The web page generated by server 120 and 15 algorithms grant rarer terms higher weights, since this will 

viewed by the user over browser 300 comprises a series of ensure tDat documents containing the full facet phrase will 

four frames 305, 310, 315, 320. Frame 305 is a text box in rise to the top of the result list. By also including compo- 

which the user may enter natural-language queries for nents of the phrase in the new query, their individual 

processing by the invention. Below that, results are dis- relevances will contribute to the search results as well, 

played in frame 310. Frames 315, 320 display terminology. 20 The user may then select a different facet phrase, in 

Scrollable upper frame 315 contains a predetermined num- response to which control block 210 replaces the previously 

ber (10 in the illustration) of select boxes, each labeled with chosen facet phrase with a new query based on the newly 

a facet. As shown in FIG. 3B, clicking the arrow on the right selected facet phrase, and causes a search to be executed 

side of a select box causes it to expand into a frequency- based on the new query. Other modes of query reformulation 

sorted menu of phrases containing the facet. 25 based on facets and facet phrases are of course possible. For 

FIG. 3 A shows a representative screen resulting from the example, the user may select just the facet term itself, which 

user's submission of the search query "Indonesia." The is added to the original search string, imparting greater focus 

items listed in window 310 are identified by server 120 to the reformulated search; or may instead replace the 

according to a conventional keyword search process through original search expression with the facet term. A new list of 

the contents of database 130. The facets presented in box 30 facets is not necessarily generated in response to successive 

315 are obtained by control block 210 (FIG. 2) based on search reformulations; retaining the original facet list pre- 

lexical analysis of the located documents listed in box 310. serves a stable feedback context across multiple query 

In other words, the user's query drives a conventional, refinements. 

coarse search through the document corpus; and lexical As shown in FIG. 3A, the user may enter a new search 
analysis of the thus-located documents provides the facets 35 expression in box 320. Upon receipt of this term, control 
available in box 315. As described in greater detail below, block 210 searches the search result list (shown in box 310) 
operational preference generally dictates performing the for phrases containing the new search expression and con- 
lexical analysis on the entire document collection (e.g., forming to the syntactic pattern used to identify facets. The 
before queries are accepted or during system down time) and located phrases (or a subset thereof) are listed in a pull -down 
not each time documents are located by the coarse search. A 40 list, which the user may select as discussed above with 
mapping between documents and facets is retained, respect to pull-down list 325. For example, in the case of the 
however, so that the facets presented to the user do, in fact, query "wildlife extinction," a user might enter the word 
derive from the documents located in response to the query. "bill" in box 320 to obtain a listing of legislative bills 
This approach substantially reduces system response time pertaining to the topic, 
and avoids duplicative lexical analyses. 45 2. Facet Identification 

With reference to FIG. 3A, the facets corresponding to the As noted above, facets and facet phrases are preferably 

located documents summarized in result window 310 are identified for the entire collection of documents (with new 

ranked according to lexical dispersion in box 315. The facets facets identified on an ongoing basis as additional docu- 

include the words indonesian, export, and government, all of ments are added) in database 130. Key to this process is 

which reflect generic topics that occur in a variety of more 50 selection of the syntactic pattern(s) upon which the disper- 

specific contexts throughout the result list. sion analysis is to be based. Control block 210 searches 

FIG. 3B presents the set of contents for the query "cook- through the documents in database 130 for lexical constructs 
ing." Clicking on the term "cheeses" reveals the phrases conforming to this pattern (i.e., instances of the pattern), and 
listed in the resulting pull-down box 325. These "facet performs a dispersion analysis to measure of the number of 
phrases" conform to the lexical construct used to identify the 55 different words with which a particular word co-occurs 
facets and represent the most frequently encountered; of within the located lexical constructs, 
course, the designer can choose (or allow the user to choose) A large number of syntactic patterns can encode valid 
to list any desired number of facet phrases in response to the semantic relationships. As mentioned above, noun corn- 
user's click of a facet selection box. The frequency of a pounds encode strong, long-lived information relating to 
facet's occurrence is preferably measured, as described 60 relationships and attributes, and may therefore serve well as 
below, with respect to the documents retrieved in response the lexical constructs from which dispersion rates are 
to the user's query rather than the document corpus as a derived. Another advantage of using noun compounds is the 
whole. ease with which they may be identified; a pattern matcher 

While the facets and phrasally associated concepts cer- scanning a tagged document corpus can readily detect 

tainly do not constitute an exhaustive catalogue of the 65 sequences of nouns. 

contents of the result list, they are nevertheless conceptually Nonetheless, nominal concepts may be connected by 

informative, in effect providing a surrogate table of contents verbs or prepositions, which often express the same rela- 
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tionships encoded in noun compounds. Even simple 
co-occurrence within a sentence, paragraph, or document 
often implies some semantic relationship. And not all useful 
semantic relationships are encoded as compounds; limiting 
the search to noun compounds may lead to informational 
gaps. 

As adjectives are commonly used to specialize concepts, 
we have found it useful to also search for noun phrases 
preceded by an adjective modifier (e.g., "international lav/* 
or "ancient history"). This construct also captures com- 
pounds formed with the adjectival form of a morphologi- 
cally related noun (e.g., "French hotel," "literary criticism"). 
Many nouns and adjectives that frequently occur in phrases 
have little value for information retrieval, however. A noise - 
word filter can be used to ensure that such candidate phrases 
are ignored. Typical filtered terms include quantitative nouns 
and adjectives, such as cardinal or ordinal numbers; words 
such as "many," "some," and "amount"; temporal nouns 
such as "year"; and qualitative adjectives such as "signifi- 
cant" and "reasonable." 

In operation, control block 210 first constructs a facet 
database from document database 130 prior to query-time. 
Runtime selection of facets and values is performed when a 
user has entered a query to return a set of facets pertinent to 
the result list. 

Construction of the facet database can be accomplished in 
accordance with the following steps applied to each docu- 
ment in the collection, which utilizes noun phrases contain- 
ing at most one adjective to identify facets. 

1. Tokenize the document. This refers to separating the text 
into terms or "tokens" using a set of predefined "delim- 
iters" appropriate to the documents. In the case of English 
text documents, these delimiters would include spaces, 
periods, commas, and other punctuation so that each 
token corresponds to a single word or term. 

2. Tag each token with a "part of speech" tag indicating the 
syntactic category (noun, adjective, preposition, etc.) of 
each token. 

3. Extract all phrases composed of sequences of tokens that 
match the syntactic pattern ?<adjectivexnoun>+ and 
have a total length between 2 and, for example, 5 tokens. 
In the foregoing formula, the symbol "?" refers to an 
optional term, so that the formula specifies a single 
adjective or no adjective. The symbol "+" indicates that 
more of the preceding term category is allowed, so the 
formula specifies a phrase containing at least two words 
the last of which is a noun. 

4. Remove any phrases that match a noiseword filter. 

5. For each remaining phrase, 

a. if the last term in the phrase (i.e., the head noun) is 
lower case, then replace it with, its morphologically 
uninfected form. This essentially canonicalizes plural 
head nouns to their singular forms. For example, 
"Beethoven symphonies" would be transformed into 
"Beethoven symphony." 

b. For each term in the canonicalized phrase, create a 
"facet tuple" in which the first clement (the facet) is the 
term and the second element (the facet phrase) is the 
full phrase. There is one exception, designed to elimi- 
nate facets likely to be first names: if all items in the 
phrase are capitalized, then the first term of the phrase 
is not used to compose a facet tuple. However, facet 
tuples are composed for the remaining phrase elements. 

6. Create a file containing the list of facet tuples so created 
and preserve a mapping between the source document and 
the corresponding facet file. For example, if each source 
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document is assigned a unique document identifier, then 
a directory of facet files could be built, with each file 
named according to its corresponding document identifier. 
Accordingly, if there is a source file with doc_Jd 26 
containing the text "The jazz band played several long 
symphonic pieces by Leonard Bernstein," there should be 
a corresponding facet file containing the following facet 
tuples: 

<jazz, jazz band> 
<band, jazz band> 
<symphonic, symphonic piece> 
<piece, symphonic piece> 
<Bernstein, Leonard Bernstein> 

Runtime selection of facets and values can be accom- 
plished in accordance with the following steps: 



30 



35 



1. Define the following parameters with respect to the 
ranked document set D computed as the result list for the 

20 q uer y Q : 

a. OCC (occurrence cut-off): the number of documents D 
in which a facet tuple must occur. 

b. DISP (dispersion cut-off): the number of different 
unique phrases with which a facet occurs within D. 

25 c . FCAND_DOCS (facet candidate documents cut-off): 
the number of documents from the ranked set O from 
which candidate facets are to be extracted, 
d. FVAL_DOCS (facet value documents cut-off): the 
number of documents from the ranked set D from 
which facet values are to be extracted. 
FFDOC (facets from document cut-off): the number of 
facet tuples to be considered from any one document. 
Facet tuples from the beginning of the document up to 
this cut-off are considered. 

FCAND (facet candidate cut-off): the number of facet 
candidates to consider based on "dispersion" alone. 

g. FFINAL (facet final cut-off): the number of facet 
candidates to output after ranking by "tf.idf" (i.e, the 
well-known term frequency/inverse document fre- 
quency weighting algorithm, described in greater detail 
below). 

h. PHR (phrase cut-off): the maximum number of 
frequency-ranked phrases to be retained for any one 
facet. 

i. CSIZE (collection size): the number of documents in the 
total collection (of which D is a subset). 

2. For the first FCAND__DOCS documents in the ranked 
result list D, construct a list of the first FFDOC facet 
tuples from each of the corresponding facet files. 

3. Compute the number of occurrences of each facet tuple in 
this list and remove those with number of occurrences 
<OCC. 

4. For each facet in this list, compute a frequency-sorted list 
of the phrases that appear in facet tuples with the facet. 
The number of different unique phrases occurring for each 
facet will be referred to as its "dispersion." For each 
phrase associated with a facet, retain its frequency (i.e., 
the number of documents in the first FFDOC documents 
in D in which it occurred) and the phrase itself. So, at this 
point a list of facet information has been generated in the 
following form: {facet dispersion {phrase 1 freq} {phrase 
2 freq} . . . } 

5. Sort the set of facets by their dispersions and retain the top 
FCAND facets along with their facet information 
(computed in the previous step). Call this set of facets the 
"facet candidates." 
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6. Compute a weight for each facet using a weighting It is also possible to impose a "line cut-off" value to limit 
measure. Such a measure should weigh facets more candidate facets to those occurring within the first n lines of 
highly if they appear more frequently in the document set a document. This is useful where the collection contains 
D (the "spread" component of the weight) but less fire- long documents in which topics change, so many of the 
quently in the database overall (the idf component of the 5 phrases in the article may be unrelated to the query. A line 
weight). Weight measures may be computed as follows: cut-off is particularly useful in the case of search engines 

a. Method 1 ("dispersion") whose ranking strategies favor terms appearing near the 
Let weight=dispersion of the facet beginning of articles, so that query terms are more likely to 

b. Method 2 ("spread") occur near the front of highly ranked articles. As a result, 
Let sp be the number of documents in the first 10 unrelated phrases may be heuristically eliminated by con- 

FCAND__DOCS articles of the result list in which a sidering only those appearing in the first n document lines. 

facet tuple for the facet exists ^ foregoing approach generates facets in two passes. 

Let dispersion_limit be a number representing a par- First, lexical dispersion is used to select a set of n (e.g., 50) 

ticular dispersion size (e.g., 5) " candidate facets. Then a ranking measure is used to re-rank 

Let weight be computed as spnOO if dispersion of the 15 * e Spread.icf is a variant of tf.idf, a conven- 

r ' v *t tional algorithm commonly used to weight terms based on a 

ace > ispersion__ mi combination of their density within a single document (term 

Let weight be computed as dispersion if dispersion of frequencVj or tf) and xmA ( y within the document as 

the facet <dispersion_hmit a whok (inverse document frequency, or idf). See, e.g., 

c. Method 3 ("spread.icf) Salton, "Another Look at Automatic Text-Retrieval 
Let sp be the number of documents in the first 2 o Systems, Comm. of ACM, 29:7, p. 648-56 (1986), which is 

FCAND_DOCS articles of the result list in which a hereby incorporated by reference. Applying this weighting 

facet tuple for the facet exists has the desired effect of promoting facets that occur more 

Let cf be the total number of occurrences of the facet widely throughout the result set, while demoting terms that 

term within the document collection are l00 prevalent throughout the corpus as a whole. 

Let icf be computed as CSIZE/df 2 5 It will therefore be seen that we have invented an 

Let wf 1 be an attenuation factor for sp approach to information retrieval that appears to a user to be 

Let w£2 be an attenuation factor for icf conceptually based but is implemented in an automatic 

Let weight be computed as fashion; it allows text searchers to refine their searches and 

become acquainted with the informational content of a 

(wfiHi00*s P /FCAND_DOCS))-(wG + in(ic0) 30 documeat database . ^ terms and expressions employed 

7. Rank the facet candidates by their weight and choose the herein are used as terms of description and not of limitation, 
top FFINAL facets as "facet finalists." and there is no intention, in the use of such terms and 

8. For the first FVAL_DOCS documents in D, construct a expressions, of excluding any equivalents of the features 
list of the first FFDOC facet tuples from each of the shown and described or portions thereof, but it is recognized 
corresponding facet files for which the facet value of the 35 that various modifications are possible within the scope of 
tuple is one of the facet finalists. the invention claimed. 

9. For each facet remaining in this list of facet tuples, What is claimed is: 

compute a frequency -sorted list of all the phrases that 1. A method of selecting and organizing documents from 

appear in facet tuples with the facet. a document corpus in response to a user-provided search 

10. Merge this list with the information in the facet finalists. 40 expression, the method comprising the steps of: 

As a result of the merge, each facet will have a new a> locating, within the document corpus, documents 

dispersion value and set of phrases. For each facet, sort matching the search expression; 

the set of phrases by frequency and retain the top PHR b identifying, within the located documents, instances of 

phrases. a lexical construct conforming to a selected syntactic 

11. Sort the facets alphabetically. Return these facets and 45 pattern thal foa^ two or more parts of speech; 

their associated phrases. c assigning dispersion rates to words within the lexical 

„ , „ „ constructs, each dispersion rate corresponding to the 

Changing various parameters can markedly affect the numbef of textuall distinct lexical constructs 

selection of facets. For example, the facet candidate docu- • tQe W0K j. 

ments cut-off factor can have a significant effect on the 50 , , . ' , . . j« 

uivui* vui uxi. u ^u* d. ranking the words in accordance with their dispersion 

eventual facet output; since the result list is ranked (by the & r 

search engine's ranking algorithm), the relevance of facets ' . -_ - 

to the query can be increased by limiting the extracted e ' f"™ 8 1 ^ ° f * ^ ^ ° f WOrdS; 

phrases to highly ranked documents. The optimal ranking f - facilitating selection of a listed word; 

cut-off may differ from query to query. 55 g- appending the word to a base search expression to form 

The facet candidate cut-off— i.e., the number of top- a new expression; and 
ranked facet candidates from the first pass to be re-ranked, h. facilitating access to documents from the corpus match- 
in accordance with the preferred implementation, by ing the new search expression, 
spread.icf— reflects a trade-off between two measures of 2. The method of claim 1 wherein the base search 
what consititutes a good facet. On the one hand, a facet 60 expression is the user-provided expression, 
should have many phrases associated with it. On the other 3. The method of claim 1 wherein the base search 
hand, it should be a term that is particularly relevant to the expression is a new user-provided expression, 
result set (and not simply a common term throughout the 4. A method of selecting and organizing documents from 
database). Good overall results have been achieved on a document corpus in response to a user-provided search 
intranet databases using a facet candidate cut-off value of 50 65 expression, the method comprising the steps of: 
along with both the spread and the spread.icf ranking a. locating, within the document corpus, documents 
strategies. matching the search expression; 
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b. identifying, within the located documents, instances of a. a web server; 

a lexical construct conforming to a selected syntactic D means for generating web pages for transmission to 

pattern that includes two or more parts of speech; remote users via the server; 

c. assigning dispersion rates to words within the lexical c means for receiving> vi a thc server, the search expres- 
constructs each dispersion rate corresponding to the 5 ^ ^ ^ a ted web ge 
number of textually distinct lexical constructs contain- u Tfae ^ q{ ^ n whefein ^ mterface [s 
ing the word; configured to: 

d. ranking the words in accordance with their dispersion .. . r . , c . , , 
rateg . ° r a. present a list of at least some or the words; and 

e. Lmtating selection of at least one of the words; 10 b - sel f c,io ° of a ^ word . and a W> end the 

, , , , . selected word to a base search expression to form a new 

f. for each selected word, presenting a sorted list or lexical , . *, , . . 

. 4 • « . ] j , i search expression, the search module receiving the new 

constructs that appear in the located documents and . r . ' , . .u * i *- 

contain the wordj and search ex P ression and > m res P onse theret0 ' locatin S 
- documents from the corpus matching the new search 

g. facilitating selection of a listed lexical construct. 15 expression 

5. The method of claim 4 further comprising the steps of: 1/t ' f , - ia . . #u . ... , , 

* & r 14. The apparatus of claim 13 wherein the initial search 

d. appending the lexical construct to a base search expres- expression is ^e user-provided expression. 

sion to form a new search expression; and 15 The apparatus of ciaim 13 wherein the mitial ^arch 

e. facilitating access to documents from the corpus that expression is a new user-provided expression, the interface 
match the new search expression. being configured to receive the new user-provided expres- 

6. The method of claim 4 further comprising the step of s j oa 

facilitating access to documents from the corpus that match 16 The a aratus of cIaim n whwin the control module 

the selected lexical construct. ^ further configured to: 

7. The method of claim 4 wherein the base search , c . i c ^ j i 

0 , ac - • t , „ Mr 25 a - facilitate selection of at least one of the words; and 

expression is tne user-pro viaed query. 

8. The method of claim 4 wherein the base search b. present, for each selected word, a sorted list of lexical 
expression is a new user-provided query. constructs that appear in the located documents and 

9. The method of claim 1 wherein the syntactic pattern is contain the words, the interface being configured to 
?<adjectivexnoun>+. present the list and facilitate selection of a lexical 

10. The method of claim 1 further comprising the step of 30 construct from the list. 

removing lexical constructs matching a noiseword filter. 17 - apparatus of claim 16 wherein the interface, in 

11 Text-searching apparatus comprising: response to selection of a lexical construct, is further con- 

a. a digitally searchable corpus of documents; **** t0 a PP e , nd the lexical cor f mct t0 an ini ^ 

. / . . , expression to form a new search expression, the search 

b. an mterface for receiving a search expression; 35 module being respon si V e to the new search expression and 

c. a search module, responsive to the search expression, locating documents from the document database matching 
for locating documents in the corpus matching the the new search expression. 

search expression; and 18. The apparatus of claim 17 wherein the initial search 

d. a control module configured to: expression is the user-provided expression. 

i. identify, within the located documents, instances of a 40 19. The apparatus of claim 17 wherein the initial search 
lexical construct conforming to a selected syntactic expression is a new user-provided expression, the interface 
pattern that includes two or more parts of speech; being configured to receive the new user-provided expres- 

ii. assign dispersion rates to words within the lexical sion. 

constructs, each dispersion rate corresponding to thc 20. Thc apparatus of claim 11 wherein the syntactic 

number of textually distinct lexical constructs con- 45 pattern is ?<adjectivexnoun>+. 

taining the word; and 21. The apparatus of claim 11 wherein the control module 

iii. rank the words in accordance with their dispersion is further configured to remove lexical constructs matching 
rates. a noiseword filter. 

12. The apparatus of ciaim 11 wherein the user interface 

is a web page and further comprising: * * * * * 
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